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Stochastic rearrangement of germline DNA by VDJ recombination is at the origin of immune 
system diversity. This process is implemented via a series of stochastic molecular events involving 
gene choices and random nucleotide insertions between, and deletions from, genes. We use large 
sequence repertoires of the variable CDR3 region of human CD4+ T-cell receptor beta chains to 
infer the statistical properties of these basic biochemical events. Since any given CDR3 sequence can 
be produced in multiple ways, the probability distribution of hidden recombination events cannot be 
inferred directly from the observed sequences; we therefore develop a maximum likelihood inference 
method to achieve this end. To separate the properties of the molecular rearrangement mechanism 
from the effects of selection, we focus on non-productive CDR3 sequences in T-cell DNA. We infer 
the joint distribution of the various generative events that occur when a new T-cell receptor gene 
is created. We find a rich picture of correlation (and absence thereof), providing insight into the 
molecular mechanisms involved. The generative event statistics are consistent between individuals, 
suggesting a universal biochemical process. Our distribution predicts the generation probability 
of any specific CDR3 sequence by the primitive recombination process, allowing us to quantify the 
potential diversity of the T-cell repertoire and to understand why some sequences are shared between 
individuals. We argue that the use of formal statistical inference methods, of the kind presented 
in this paper, will be essential for quantitative understanding of the generation and evolution of 
diversity in the adaptive immune system. 



I. INTRODUCTION 

Receptor proteins on the surfaces of B- and T-cells in 
the immune system interact with pathogens, recognize 
them and initiate an immune response. The diversity of 
these receptors is the outcome of a remarkable process in 
which germline DNA is edited to produce a repertoire 
of (T- or B-) cells with varied antigen receptor genes 
[I]. The process is called VDJ recombination because 
the germline contains multiple versions of so-called V-, 
D-, and J-genes, particular instances of which are quasi- 
randomly selected, stochastically edited and joined to- 
gether to produce a new surface receptor gene each time 
a new immune system cell is generated. 

The statistical distribution of these biochemical events 
(and the resulting receptor coding sequences) in a popu- 
lation of newly-created receptors is an important quan- 
tity: it contains information about the in vivo functioning 
of the biochemical editing mechanism and provides the 
baseline for a quantitative assessment of the downstream 
workings of selection in the adaptive immune system. 
Here, we address the problem of inferring this distribu- 
tion from the large T-cell sequence repertoires that are 
becoming available via high-throughput sequencing tech- 
nology [2H5]. In particular, we focus purely on a subset 
of receptor sequences that are non-productive, due to a 
reading frame shift or an accidental stop codon, to iso- 
late the statistics of the molecular mechanism from the 
effects of selection on the functional repertoires. 

In the beta chain of human T-cell receptors (the focus 
of this work), the germline has 48 different V-genes, 2 D- 



genes and 13 J-genes. VDJ recombination proceeds by 
first joining a D-gene with a J-gene, and then a V-gene 
with the DJ junction. First, the recombination activat- 
ing gene (RAG) protein complex brings two randomly 
chosen D- and J-genes together, cuts out the interven- 
ing chromosomal DNA, and forms a hairpin loop at the 
end of each gene [HI |7j . In further steps [HI H] the hair- 
pin loops are opened, creating overhangs at the end of 
both genes that may eventually survive as P-nucleotides 
(short inverted repeats of gene terminal sequence) |10) . 
This is followed by nucleotide deletions and insertions at 
the junctions and ends with ligation. The process is then 
repeated between a random V-gene and the D J junction. 
The end product is the so-called CDR3 region of the re- 
ceptor gene: a short, highly variable region that plays an 
essential role in determining the antigen specificity of the 
cell. 

Each recombined sequence can thus be thought of as 
the outcome of a generative event described by several 
random variables (Fig. [I]): V-, D-, and J-gene choices, 
deletions of variable numbers of nucleotides from the se- 
lected genes, insertions of random nucleotides between 
them, and the possible creation of P-nucleotides (short 
palindromic nucleotides as in Fig.]!^ at the 3' end of the 
D-gene). From the set of observed CDR3 sequences, we 
wish to infer the underlying probability distribution of 
these generative events. 

To date, this inference has been done via a determin- 
istic alignment procedure which assigns a unique event 
to each sequence [HE]- However, since individual CDR3 
sequences can arise in multiple ways (see Fig. [I]), this as- 
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signment must be done probabilistically. Deterministic 
alignment introduces spurious biases and correlations in 
the statistics of generative events (Fig. [2]). Thus, a sta- 
tistical inference procedure is needed to accurately infer 
the underlying event probability distribution from the 
data. In this paper we present such a method, based 
on likelihood maximization via an iterative expectation- 
maximization algorithm and apply it to recent data 
on human T-cell receptor sequences. 



II. ANALYSIS STRATEGY 

We work with sequence data on CD4+ T-cell beta 
chain CDR3 regions obtained from nine human subjects 
as described in [T2]. In these experiments, T-cells are 
collected from a blood sample, and sorted into 'naive' 
(CD45RO-) and 'memory' (CD45RO+) compartments, 
DNA is extracted, and sequence reads long enough to 
capture a 5 ' piece of the J gene, a 3 ' piece of the V gene 
and the variable sequence lying in between, are obtained. 
Each sequence is read multiple times, and a clustering al- 
gorithm is used to correct for sequencing error |12j . This 
process produces a data set consisting of an average of 
232,000 (140,000) unique CDR3 sequences from the naive 
(memory) compartments for each individual subject |26j . 
Each unique sequence comes with a multiplicity (ranging 
over three orders of magnitude) reflecting the prevalence 
of that particular cell type in the blood sample. 

Roughly 14% of the unique CDR3 sequences are 'non- 
productive' i.e. either their J genes have been shifted 
out of the correct reading frame or the CDR3 sequences 
have a premature stop codon. They arise from a recom- 
bination event on one of a cell's two chromosomes that 
failed to make a functional receptor, followed by a suc- 
cessful recombination on the other chromosome. Such 
sequences should not be subject to functional selection, 
and their statistics should reflect only the VDJ recom- 
bination process |27j . Since this is our primary concern, 
we focus our analysis on the non-productive CDR3 se- 
quences, of which there are an average of 35,000 (22,000) 
in the naive (memory) compartments for each individ- 
ual subject. We analyze the naive and memory data sets 
separately to be able to verify the absence of selection 
effects. 



A. Structure of recombination event distributions 

Each CDR3 generating recombination event can be 
fully characterized by a set E of discrete variables com- 
prising: the identities of the V-, D- and J-genes selected 
for recombination [55] (V,D,J); the numbers of bases 
deleted from the 3' end of the V-gene (delV), the 5' 
end of the J-gene (delJ), and both ends of the D-gene 
(del5'D and del3'D for the 5' and 3' ends, respectively); 
the number of palindromic nucleotides at each of the gene 
ends (palV, palJ, pal5 'D, pal3'£>); the specific sequence 



(x\, . . . , Xi ns vD) of length insVD inserted at the VD junc- 
tion, and the specific sequence, (yi, . . . , y- m sDj) of length 
ins-D J inserted at the DJ junction (see Fig.fl]). We choose 
a convention in which both sequences are read in the 5' 
to 3' direction, but the VD (DJ) inserted sequence is read 
from the sense (antisense) strand. 

We seek a joint distribution over all of these variables 
containing the minimal set of dependences between the 
variables that is required to self-consistently capture the 
observed correlations in the data. We find that the fol- 
lowing factorized form for the probability of a recombina- 
tion event E (defined by specific values for all the event 
variables) successfully captures all the significant corre- 
lations between sequence features that are present in the 
data (see Fig.[2|: 

-PrccombOE) =P(V)P(D, J)X 

P(delV|V) P(de\J\J) P(del5'D, dcl3'L>|L>)x 

insVD insDJ 

P(insVD) Y[ ^(xilxi-^PQnaDJ) J[ pj&(i«||fc-i). 

i=l i=l 

(1) 

The various factors are normalized joint or conditional 
distributions on their respective arguments. P(V) and 
P(D, J) account for the fact that the various genes 
have different usage probabilities (and that D- and J- 
gene usage is correlated). The factors P(delV|V"), etc., 
are distributions on the number of nucleotide deletions, 
conditioned on the gene being deleted (deletion profiles 
turn out to be very gene-dependent). P(insVD) and 
P(insD J) give the probabilities of different numbers of 
nucleotide insertions at each junction. The parameters 
Py]j and p^j account for possible nucleotide bias in the 
insertions: they give the conditional probabilities of in- 
serting a specific nucleotide given the identity of the im- 
mediately 5' nucleotide, with xq referring to the last nu- 
cleotide at the 3' end of the truncated V-gene on the sense 
strand for a VD insertion, or at the end of the truncated 
J-gene on the antisense strand for a DJ insertion. 

P-nucleotides do not appear explicitly in Eqn.[T] we 
treat them as 'negative' deletions (i.e. a palindrome of 
half-length 2, as in Fig.[TJ\, is counted as a deletion of 
value —2). This is possible because we find that when 
the number of nucleotide deletions is greater than zero, 
occurrences of palindromic nucleotides at the end of the 
gene segment are completely explained by chance in- 
sertions of the corresponding nucleotides (see Fig. 15 1. 
Thus, true P-nucleotides, not attributable to chance in- 
sertions, only occur in association with zero nucleotide 
deletions and it is consistent to label them as 'negative' 
deletions. 

The factors in our equation for P rcC omb(-£') (Eqn.JT]) are 
probability distributions on event variables that take on 
a finite number of values. Specifying this joint distribu- 
tion requires a total of 2865 probabilities (more than 90% 
of which are needed for the deletion length probabilities 
of the individual V-, D- and J-genes). Despite the large 
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FIG. 1: A 60bp CDR3 read (grey box) can be aligned to different genes (nomenclature follows IMGT conventions [13]) with 
different deletions (white), insertions (yellow), and P-nucleotides (red). (A) Alignment to specific V-, D-, and J-genes with 
insVD=13, insDJ=6, delV=5, delJ=6, del5'D=6, del3'D=— 2 (in other words, pal3'D=2). (B) Alignment of the same read 
to different V- and D-genes, and with insVD=15, insDJ=9,delV=7, del5'D=9, del3'D=3 (no P-nucleotides). Note that the 
alignment to the V-gene is not maximal in this case. A few heavily penalized mismatches are allowed (in the V-gene in this 
example) in order to accommodate a small sequencing error rate. The location of the sequencing primer is indicated: it is 
chosen to uniquely identify the start of the CDR3 read within each J-gene. 



number of probabilities to be inferred, we are able to 
determine them accurately and without overfitting. We 
emphasize that our goal is to obtain an accurate descrip- 
tion of recombination event statistics, and not (yet) to 
explain those statistics mechanistically. 

B. Generation probability and likelihood of 
observed sequences 

The probability P gcn (a) of generating a specific CDR3 
sequence a is the sum of the probabilities of all recombi- 
nation events E a that produce a: 

P g cn(<?) = P rccomb(£). (2) 

The likelihood L(a) of observing a specific CDR3 se- 
quence read a, however, must take into account residual 
sequencing error as well as allelic variation, and is given 
by a sum over a larger set of recombination events E a 
that generate sequences close to a: 

L{a) = P (^' cr ) where ( 3 ) 
P(E,a) = P Iecomh (E)x {1 + R)L 
x £ P(V a \V E )P(J a \J E )P(D a \D E )(-\ 

alleles a 

(4) 

In the latter equation, n orr is the number of mismatches 
between the observed read a and the CDR3 sequence a E 
that would be produced by the recombination event E 
with allele choices a. L is the length of the sequence read. 
The mismatch rate R is determined in the inference with 
the rest of the distribution parameters and reflects both 



sequencing error as well as unknown allelic variation. In 
practice, we only consider recombination events E a that 
lead to CDR3 sequences with at most a few mismatches 
from a. The sum over alleles [2§] arises because we do 
not know a priori which alleles are present and reads 
may not go deep enough into the gene sequence to clearly 
distinguish alleles from each other [14] . The probabilities 
of the different alleles, given a gene, are also inferred and 
are expected to differ from individual to individual. 

The likelihood of the whole data set V is then the prod- 
uct over the individual sequence likelihoods: C(T>) = 
llo-ex) L( a )- This expression depends implicitly on the 
parameters defining the generative probability distribu- 
tion (along with the allele distributions and the sequenc- 
ing error parameter), and we infer their correct values 
by maximizing C(T>) using an expectation maximization 
algorithm [TTJ rj5] (see Appendix [C] for details on the 
algorithm and its convergence). In order to identify uni- 
versal features of the diversity generation machinery, we 
perform this inference separately for each individual sub- 
ject. 

III. RESULTS 

In what follows, we present results of our analysis 
of naive, non-productive, CDR3 sequence repertoires of 
nine individuals (see Appendix [F] for a parallel analysis 
of memory sequence repertoires). 

A. Correlations between event variables 

It is important to verify that correlations not present 
in the assumed structure of the probability distribution 
(Eqn.JT]) are in fact not present in the data. To per- 
form this self-consistency check, we use the inferred gen- 
erative distribution to compute the probability-weighted 
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FIG. 2: (A) Data-derived correlations between sequence 
features: each entry is the mutual information I(X, Y) of a 
feature pair over the naive non-productive repertoire. The 
outlined elements are correlations expected from the form 
of PrccombiE): red identifies a direct effect of a factor in 
Eqn.JT] (e.g. D -o- J) and green indirect effects (e.g. D «-> 
J o delj). The top-left half of the matrix shows results from 
the maximum likelihood estimate (MLE), while the bottom- 
right half corresponds to a deterministic maximum-alignment 
based identification of recombination events. (B) Probability 
distribution of the number of VD insertions conditioned on 
the number of DJ insertions for MLE (top) and deterministic 
(bottom) analysis. Each curve corresponds to a different value 
of insDJ, ranging from (blue) to 10. The curves collapse for 
MLE indicating independence. 



B. Gene usage distributions 

The inferred frequencies of V- and J-genes vary signif- 
icantly from gene to gene, a phenomenon for which no 
mechanistic explanation has yet been given. In particu- 
lar, linear location on the chromosome does not explain 
the pattern of either V- or J-gene usage (see Fig.[9j\, C). 
The usage frequencies are consistent between individuals, 
though of all the inferred parameters in P rcC omb, these 
usage patterns show the most relative variation between 
individuals. 

The pattern of D-gene use conditioned on J-gene choice 
(Fig.[9j3) reveals the known mechanistic constraint pro- 
hibiting utilization of D-genes that lie 3 ' of the chosen 
J-gene [T] . The inferred distribution assigns a total prob- 
ability of less than 0.1% for joining events using TRBD2 
and any TRBJ1 gene. We note that such a determina- 
tion is impossible without probabilistic analysis due to 
the uncertainty in identifying genes in specific sequences. 
The dependence between V gene choice and D or J gene 
choice is very weak to non-existent (with mutual informa- 
tion less than 0.01 bits). Thus, we believe that previously 
reported correlations in the use of these genes [T5] reflect 
the effects of selection rather than VDJ recombination. 
Finally, we note the presence of pseudo V-genes which 
occur in almost 10% of the non-productive CDR3s (see 
Appendix [E] for more details). 



counts distribution of recombination event variables in 
the data, and then use this distribution to calculate 
the mutual information of all pairs of event variables. 
The matrix of mutual information values is shown in the 
upper-triangular part of Fig. [2)4., where the entries out- 
lined in red are dependences accounted for by individual 
factors in our assumed form of P Te comb(E) (Eqn.[T]), en- 
tries outlined in green are indirect dependences that can 
be induced by these factors, and the rest would vanish if 
the data were perfectly described by the assumed struc- 
ture of -Prccomb(-E)- There are a few detectable correla- 
tions that are not consistent with the assumed structure: 
{msVD,delV), (insDJ,delJ) and (V,D). They are, how- 
ever, all so weak (mutual information < 0.02 bits) that 
we do not model them explicitly (indeed, they might arise 
from subtle biases in our inference procedure). 

For comparison, in the lower-triangular part of Fig.[2|\ 
we show the mutual information values of all pairs of 
variables, but now calculated from a deterministic as- 
signment of events to sequences based on maximal align- 
ments. The resulting distributions exhibit spurious cor- 
relations that are absent from the corrected, maximum 
likelihood estimate (MLE) of the distributions. For in- 
stance, the number of insertions at the two junctions are 
found to be independent in our analysis while the uncor- 
rected estimate shows a dependence (Fig. [2)3, C). 



C. Nucleotide insertions 

In Fig. [3] we show the factors related to insertions in 
the inferred distribution P IC comb(E). The VD and DJ 
insertions are uncorrelated (Fig. [2]) and their length dis- 
tributions are nearly identical, with exponential tails 
(Fig. [3^). The nucleotide frequencies in the inserted seg- 
ments are not uniform and are well explained by a di- 
nucleotide Markov model where the probability of in- 
serting A, C, G, or T depends on the immediately 5' 
nucleotide (see Fig. [3^3). The VD inserted segment, on 
the sense strand, and the DJ inserted segment, on the an- 
tisense strand, show a preference for Cs. The frequencies 
of tri-nuclcotides are almost perfectly accounted for by 
the di-nucleotide preferences (Fig.|3p), suggesting that 
the sequence statistics are fully captured by dinucleotide 
statistics. Additionally, the VD insertion di-nucleotide 
bias, taken on the sense strand in the 5'-3' direction, is 
virtually identical to the DJ insertion di-nucleotide bias, 
taken on the antisense strand in the 5'-3' direction. This 
suggests that the mechanism of junctional nucleotide in- 
sertions is strand specific and occurs on opposite strands 
for the VD and DJ junctions. The molecular mechanistic 
basis of these features is not evident. 
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FIG. 3: Statistics of VD and DJ insertions. (A) Insertion 
length profiles: maximum likelihood estimate (deterministic 
estimate) displayed as solid (dashed) lines; error bars show 
variation across the nine individuals. The distribution tail 
is accurately exponential. The deterministic estimate greatly 
overestimates the frequency of zero insertions. Inset: mono- 
nucleotide utilization bias. (B) Dinucleotide utilization in in- 
sertions; the bias in DJ insertions is very accurately the re- 
verse complement of the VD insertion bias.( C) Higher-order 
nucleotide bias in VD (blue) and DJ (red) insertions is com- 
pletely accounted for by dinucleotide statistics. 
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FIG. 4: (A) Gene-specific deletion profiles for selected V 
(red) and J (green) genes: the profiles vary widely from gene 
to gene, but are nearly identical across individuals (all nine 
are plotted; one in grey from an individual with significantly 
smaller sample size). The blue curves in all panels show the 
predictions of a simple model for the sequence context depen- 
dence of deletion probabilities using a position weight matrix 
(PWM), fit to the V deletion profiles (see Appendix |L"| for de- 
tails). The model ignores P-nucleotide generation and lacks 
any effects of distance from the gene end but performs rea- 
sonably well (r 2 = 0.7). (B) Sequence logo of the context 
dependence of deletion probability, from the PWM fit to the 
V deletion profiles. Only positions 3 ' of the deletion site have 
strong effects on the probability. (C) Cumulative deletion 
profiles for V-genes and J-genes. Error bars indicate varia- 
tion across individuals. 



D. Nucleotide deletions 



Since there is a strong correlation between number of 
deletions and gene identity (see the entries for I(de\V, V) 
and I (del J, J) in Fig.[2|, we allow for gene-dependent 
deletion profiles in -P rccom b (E) (Eqn.JT]). The results for a 
few genes are shown in Fig.[4j\ (see Figs. [T7pO for all the 
profiles). P-nucleotides are counted as negative deletions 
as they occur only in association with zero nucleotide 
deletions (see Fig. 15 ). The profiles have substantial vari- 



ation from gene to gene, suggestive of a nuclease activity 
that depends on sequence context, but they are highly 
consistent between individuals. We have modeled this 
context dependence using a position weight matrix sum- 
ming independent contributions from the bases in a 6 nu- 
cleotide window (four 3 ' and two 5 ' ) around the cutting 
point to the log probability of deletion (see Fig.pj3 and 
Figs. [l7p0| for details). We find that only bases 3^of the 
deletion site have a strong effect on the probability, with 



T and A nucleotides having the greatest contribution, 
consistent with previous observations [17]. This simple 
model, which ignores both the P-nucleotides as well as 
the effects of distance from the end of the gene, does rea- 
sonably well in explaining the variation in deletion prob- 
abilities (r 2 = 0.7). This modeling is simply to suggest 
that the complexity of the observed deletion distributions 
may ultimately be explained by a parsimonious mecha- 
nistic model that reflects the underlying biochemistry of 
the deletion process. 



E. Consistency of distributions across individuals 

The insertion profiles, and the many different gene- 
dependent deletion profiles, are very consistent between 
individuals (Figs. [3j [I] and Figs. 17p0 l, suggesting the ac- 
tion of a universal molecular mechanism of rearrange- 
ment and providing convincing evidence against overfit- 
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ting. We note that finite sample size statistics accounts 
for less than 50% of the observed inter-individual vari- 
ance (indicated by the error bars) in some of our plots, 
possibly reflecting biological variation. 



F. Potential diversity of repertoire 

Our inferred distribution of recombination events 
(Eqn.Jlj) implies a probability distribution P gcn (er) on 
the space of all CDR3 sequences (Eqn.[2| whose entropy 
S scq = -S (T fgen(o-)logPgen(o-) is a measure of the po- 
tential sequence diversity of VDJ recombination. Since 
multiple recombination events can lead to the same se- 
quence, we cannot calculate S seq directly. We do, how- 
ever, have an explicit description of P rC comb, the entropy 
of which we can calculate: S Tecom b — 52 bits; in addition, 
we can show that sequence entropy and recombination 
event entropy are related by 



5', 



rccomb 



(S(E\a)) c 



47 bits . 



(5) 



where the correction term, (S(E\a)) a ~ 5 bits, is the 
entropy of recombination events that give the same se- 
quence, averaged over sequences. This means that CDR3 
sequences can be generated in ~ 32 different ways, on 
average, by VDJ recombination; this is the fundamen- 
tal reason why we must resort to probabilistic inference 
methods. The total sequence diversity of 47 bits cor- 
responds to a potential CDR3 repertoire size of ~ 10 14 
sequences [30J . This is to be compared with the esti- 
mated 4 x 10 6 unique CDR3 sequences in an individual 
PfTB] , the ~ 10 11 T-cells in the blood of an individual 
[HI] and the ~ 10 13 potential peptide-MHC complexes 
[20] . While convergent recombination means that the 
sequence entropy cannot be neatly partitioned into con- 
tributions from gene choice, deletions and insertions, the 
entropy of recombination events S recom b can be so par- 
titioned (Fig.[5^). We note that the bulk (60%) of the 
recombination entropy comes from the nucleotide inser- 
tions, and little from gene choice (5 bits from V and 4 
bits from D and J) consistent with previous estimates 
[21] . For comparison, uniform usage of the genes would 
result in an entropy of 5.9 bits for V and 4.7 bits for D 
and J gene choices. 



G. Overlap of repertoires between individuals 

Some sequences appear in the repertoires of more than 
one individual, and we can ask whether their number and 
specific identities are consistent with chance on the basis 
of our generative distribution P gcn (cr). We see evidence of 
inter-sample contamination in some of our data leading 
to a large number of shared sequences between specific 
individuals. Eliminating such questionable cases (see SI 
Appendix for details), we are left with 21 sequences that 
occur in the non-productive repertoires of two individuals 
and none that occur in more than two. 
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FIG. 5: (A) Entropy decomposition. Top bars: sequence 
entropy is smaller than recombination entropy by 5 bits be- 
cause of convergent recombination; Bottom bars: recombina- 
tion event entropy decomposed into contributions from gene 
choice, insertions, and deletions. (B) Statistics of the 21 
CDR3 sequences shared between pairs of individuals: actual 
(red) vs. expected on the basis of the inferred P gGn (cr) (blue). 
(C) Histogram of P g0 n(o") for all sequences (blue) and for the 
21 shared sequences (red, kernel density estimate); (-Pgen) for 
the full repertoire is indicated by the vertical green line. 



The total number of shared sequences between the 
repertoire samples of any pair of individuals with sam- 
ple sizes Ni and N2 is expected to be Poisson dis- 
tributed with mean n = AqiVi^-Pgen) cr where (P gC n)<T — 
^ CT Pg en (cr). Note that while the specific shared se- 
quences are likely to have high probabilities of genera- 
tion, the number of shared sequences, without regard to 
their identities, is determined by (P sen )(T which is the 
average value of P gon over the potential repertoire. We 
estimate this quantity to be (P gcn )a — 3.4 ± 0.1 x 10~ 10 , 
by taking the mean of P ge n over the observed repertoire. 

In Fig. [5)3, we compare the expected number of pairs 
of individuals with a certain number of shared sequences 
(calculated as a sum of Poisson distributions over the 
pairs) to the observed number of such pairs, showing ex- 
cellent agreement. The specific shared sequences have 
particularly high generation probabilities according to 
our distribution, with a median value of ~ 10 -8 com- 
pared to the repertoire median of ~ 10~ 14 (Fig.pfc). 
Since the generative distribution is trained on individual 
repertoires, and is highly consistent between individuals, 
its success in accounting for recurring sequences between 
individuals is a non-trivial test of its validity. We find 
similar results for the shared sequences among the mem- 
ory repertoires (see Fig. 11). 



Convergent recombination has been proposed as an 
explanation for the occurrence of 'public' TCRs [22T 
[24] . However, the recombination entropy S(E\a) is only 
weakly correlated with the generation probability P gon (c) 
(correlation coefficient 0.13, see Fig. 12), and we find 
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that the shared non-productive sequences in our data 
do not have higher recombination entropies than other 
sequences. 



H. Results from other repertoires 

Inference of Precomb(-E') from the non-productive mem- 
ory repertoires of the same nine individuals leads to re- 
sults identical with those reported above for the naive 
non-productive repertoires (see Fig. 10|12 ) . The con- 
sistency of the inferred generative distribution between 
these repertoires as well as between the nine individ- 
uals is strong evidence that the non-productive CDR3 
sequence statistics, memory or naive, reflect only the ba- 
sic recombination process and not selection. In Fig. |13| 
we show the distributions of generation probabilities of 
CDR3 sequences from the productive repertoires. While 
it is tempting to apply our algorithm to the productive 
sequence repertoires, it would be inconsistent to do so: 
these sequences have passed selection filters, thymic and 
adaptive, and we have no analog of Eqn.[l]to parametrize 
the probability of such success. This is an important sub- 
ject for future investigation. 



IV. DISCUSSION 

We have presented a method for inferring the statis- 
tics of VDJ recombination events from the large T-cell 
receptor sequence repertoires that are now available by 
high-throughput sequencing. We emphasize the crucial 
importance of using a probabilistic approach: the typical 
CDR3 sequence can be produced by about 32 different re- 
combination events, and using a deterministic assignment 
of events to each sequence results in systematic biases 
and spurious correlations. Our general approach allows 
us to cope with not-yet-indexed alleles [14] and, most im- 
portantly, with sequencing errors, an essential task given 
the rapid growth of high-throughput but error-prone se- 
quencing technologies. 

Since we focus on non-productive sequences, our re- 
sults describe the probability distribution over CDR3 se- 
quences produced by the recombination machinery be- 
fore any functional selection has occurred. Its remarkable 
reproducibility across individuals and repertoires (naive 
and memory) provides compelling evidence for the consis- 
tency and accuracy of our method. The obtained distri- 
bution is a central feature of the adaptive immune system 
and serves as a baseline (or, in evolutionary terms, a neu- 
tral model) for analyzing the subsequent processes of the 
immune system. By calculating the entropy of the gener- 
ative distribution, we can estimate the potential diversity 
of the CDR3 sequences (~ 10 14 sequences) and the con- 
tributions of insertions, deletions and gene choices to this 
entropy. We find that insertions contribute most (60%) 
of the diversity. 



We are able to evaluate the probability of generat- 
ing any specific CDR3 sequence (including as yet un- 
observed ones). This probability could be used to esti- 
mate the strength of selection on a sequence or group 
of sequences, or the likelihood that a sequence is shared 
between individuals or repertoires. Thus, it could help 
better characterize the significance of shared or 'public' 
TCR sequences [23] . We have verified that the sequences 
that are shared between the non-productive repertoires 
of different individuals in our data are consistent with 
the predictions of the inferred probability distribution 
(Fig.[5j3,C), a very stringent test of its accuracy. 

The recombination event distributions also provide in- 
sight into the molecular mechanism of recombination, 
and should serve as a starting point for detailed mecha- 
nistic models of recombination. We find that the recom- 
bination processes at the two junctions are essentially 
independent of each other, and that insertion events are 
independent of gene choice and deletions. The inferred 
distribution confirms that a D-gene can only recombine 
with downstream J-genes. We derive a precise model for 
the composition of inserted nucleotides, based solely on 
frequencies of di-nucleotides. We also show that a rela- 
tively crude model of sequence-specific nuclease activity 
can account for the deletion probabilities reasonably well. 
Our observed distribution, which is specified by a large 
number of probabilities, should be reproduced by parsi- 
monious, but more realistic, mechanistic models. 

We have focused on characterizing the molecular gener- 
ation of nucleotide sequences that code for T-cell recep- 
tors. The functional receptor repertoire is first shaped 
by this molecular process and then by thymic selection 
and adaptation to pathogens. Quantitative models of the 
latter processes are needed for understanding the adap- 
tive immune system. While the underlying biochem- 
istry conveniently served to parametrize our sequence 
distributions, finding an analogous functionally relevant 
parametrization of amino-acid sequences to model the 
effects of selection is much more challenging [35]. Statis- 
tical analysis of the productive receptor repertoires, with 
our precise characterization of the unselected repertoire 
in hand, will hopefully aid in this effort. 
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Appendix A: Sequences of V, D, and J-genes and 
their alleles 



Accurate knowledge of the sequences of germ line V-, 
D-, and J-genes and their allelic variants is essential to 
minimize errors and bias in our analysis. There are 2 D- 
genes, 13 J-genes, and 48 V-genes, not counting alleles. 
There are in addition 19 'pseudo' V-genes on the same 
germline chromosome: they participate in the recombi- 
nation process and, though they cannot lead to a func- 
tioning receptor, they can appear in the non-productive 
sequence data sets, provided that a sequencing primer 
(or an approximate one) is present, which in our case is 
true for 11 pseudo V-genes. 

We curated a list of known and discovered allelic vari- 
ants of the V-genes by combining those found in the pub- 
lic IMGT database [13] with variants that we discovered 
with high confidence during our analysis. Not all the 
sequence reads listed in IMGT are true variants since 
many of them are from rearranged DNA with variation 
at the junctional end. Such 'variants' were removed from 
our list, unless the variation was deeper in the sequence, 
far from the edited end. In addition, we have found 
three instances of allelic variants in our data that are not 
listed in IMGT. The discovered variants of genes TRBV7- 
7 and TRBV10-1 can actually be found by BLAST in 
the NCBI database of human sequences; the variant of 
gene TRBV7-2 is not found by BLAST and appears to 
be completely novel. Undiscovered variants have rather 
small impact on overall recombination event statistics, 
but they can cause systematic errors in the inference of 
gene-specific deletion profiles. 

Complete lists of the genes and alleles used in our anal- 
ysis are available online [31 . For completeness, we also 
list the primers used by Robins et. al. [5J|3] in acquiring 
the data we analyze. 



Appendix B: CDR3 sequence data files and formats 

The CDR3 sequences used in our analysis come from 
naive or memory CD4+ T-cells of 9 human individu- 
als, and are further segregated into 'in-frame' and 'non- 
productive' sequences. The sequences are 60bp in length 
for 6 of the subjects, and lOlbp in length for the re- 
maining three. The reads of different length differ only 
in how far the sequencing window goes into the V gene: 
both types are anchored on the same conserved pheny- 
lalanine in the J-gene and have the same read depth into 
the J-gene. 

Processed sequence data was made available to us by 
H. Robins. As described in [2] [3] each sequence is read 
multiple times and the multiple reads arc used to esti- 
mate the multiplicity of each specific TCR receptor in its 
respective compartment. In addition, multiple reads are 
used to correct for sequencing errors by clustering reads 
that differ at a small number of positions [3] . In our data 
files, the effective sequence multiplicity is recorded along 



with the error-corrected sequence (although we do not 
use multiplicity in our current analysis). The data files 
used in our analysis are available online [35]. The file 
names in the repository clearly indicate the category to 
which the included data belongs. 



Appendix C: Overall description of the analysis 
pipeline and software 



recombination scenarios 



probabilities 




FIG. 6: Flow chart of the analysis pipeline. 

There are two major steps in the analysis pipeline that 
leads from a list of CDR3 sequences to a final estimate 
of the probability distribution Precomb(-E') of generative 
recombination events. The first is an 'alignment' step in 
which, for each read a, we create a comprehensive list of 
recombination 'scenarios' {E a } that could plausibly have 
produced that read. A 'scenario' is a particular set of val- 
ues for the event variables (gene identities, VD insertions, 
etc.) that generates a recombined sequence nearly identi- 
cal to the read in question (with possibly a small number 
of mismatches). The second major step is an iterative 
procedure (summarized in the flow chart of Fig.[6| for 
finding the generative distribution that maximizes the 
likelihood of the observed data given the functional form 
of the generative distribution (as expressed in main text 
Eqn. 2). 

The algorithms we have developed to execute these two 
steps are described in greater detail in the following two 
subsections. Software to implement these procedures was 
written in Matlab using the Parallel Computing toolbox 
and run on a Linux cluster. Compiling key routines into 
C++ using Matlab Coder greatly improved processing 
speed, allowing model inference on an individual data set 
to be completed in about 20 hours running on 8 proces- 
sors. Our Matlab code, along with summary instructions 
on how to run it, is available online 1331 



1. Initial parsing of sequence reads by alignment 

The first step in our inference procedure is to align 
each CDR3 read with specific alleles of V, D, and J genes 
by sequence matching. The goal is to generate a set of 
plausible recombination events that could produce the 
read to serve as a starting point for subsequent prob- 
abilistic refinement. This preliminary alignment proce- 
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dure produces, for each read, a finite number of V, D, 
and J alleles, the maximal length alignments of these al- 
leles to the read, the corresponding minimum nucleotide 
deletions from the genomic sequences, with possible P- 
nucleotidcs identified, and with the unmatched parts of 
the read identified as VD or DJ insertions. Mismatch 
information is also stored. 

Certain thresholds are imposed on the alignments - 
gene alignment lengths must be sufficiently long; gene 
deletions must not be too large; errors are allowed in the 
alignments (no gaps), but the number of errors must be 
small. The alignment score (using an appropriate mis- 
match penalty) is used to rank order alignments, and a 
threshold on the score relative to the score of the best 
alignment is also imposed. Specific values for these vari- 
ous parameters are chosen in the light of computational 
experience to achieve fast and accurate convergence of 
the overall model-fitting algorithm. 

The procedure for finding J matches is simplest. The 
CDR3 reads all begin at the 3 ' end (sense strand) from 
a primer in a known position in each J gene. Thus for 
each candidate J gene, we simply look for exect matches 
of the end of the sequence read with the portion of the 
gene just 5 'of the primer. Proceeding in this way, and 
imposing the various thresholds mentioned, we find an 
average of 2-3 J alignments per read. 

For the V-gene, the position of alignment to the read 
is not fixed. So for a given V-gene, we align the 5' end of 
the read to the m-th base from the 3' end of the V-gene, 
and note the best-scoring match at this positioning (this 
time allowing some mismatches, and penalizing them in 
the score). We step through the values of m and record 
the best-scoring match over all positionings. Repeating 
this process for all the V-genes, and imposing the earlier 
mentioned thresholds, we are left with a limited set of 
possible V-gene identifications, together with their spe- 
cific alignments to the read. Proceeding in this way, we 
find an average of ~ 15 V alignments per read. 

After identifying the plausible alignments to V- and 
J- genes, we turn to the problem of identifying D-gene 
matches. This is a more difficult problem because the D- 
genes are short, and deletions (occurring on both ends) 
often leave residual sequences which are hard to identify 
as a D-gene fragment. We therefore put very loose con- 
straints on the D-gene alignments, relying on the proba- 
bilistic refinement to narrow them down. Specifically, we 
consider the read sequence segment lying between the end 
of the highest-scoring V-gene and the end of the highest- 
scoring J-gene, and include 10 nucleotides of flanking se- 
quence on either side, to allow for ambiguous origin of 
these bases. We identify as a possible D-gene match ev- 
ery maximal non-overlapping alignment to this segment 
of the three D-gene alleles. These D-gene matches are 
scored by their length and the top 200 are selected as 
possible D-gene alignments. 

Alignment files are available online [34]: the files are 
in Matlab format and record the outcome of the above 
alignment strategy for a subset of our data. Inspection of 



the alignment data for individual sequences should pro- 
vide instructive illustrations of the above-described pro- 
cedure. The various thresholds and parameters used in 
the procedure are found in the files as well. The full set 
of alignment files used in our analysis can be generated 
using routines provided in our online software repository. 

We note that one could generate a unique assignment 
of sequence features to a given read by selecting from the 
alignment ensembles just described the V, D, and J as- 
signments with the highest score (i.e. having the longest 
effective alignment with the read). We will call the oc- 
currence distribution of gene assignments, insertions, and 
deletions produced in this way as the 'deterministic ' es- 
timate of the sequence feature probability distribution. 
It corresponds to standard practice in the literature for 
inferring feature statistics from sequence data, and will 
be used as a benchmark for comparison and contrast with 
our more accurate probabilistically inferred distribution. 



2. The expectation maximization algorithm 

As described in the main text, we wish to find model 
parameters that maximize the likelihood of the data. We 
use an iterative Expectation-Maximization algorithm to 
do this. Given a current guess for the model parame- 
ters that describe -Prccomb(-E'), we update it by calculat- 
ing the probability-weighted counts of events over the 
data set and then using those counts to re-estimate the 
marginal distributions (P(V), P(D,J), P(insVD), and 
so on) that appear as factors in the general functional 
form of -Precomb (-E 1 ) (main text Eqn. 2). 

As indicated in main text Eqns. 2-4, the joint likelihood 
of a recombination event E and sequence a is the prod- 
uct of two factors: the probability of the generative event 
(given by P r0 comb(E)) , and the sum over allele choices a 
of the probability of those allele choices multiplied by 
the probability of the number of mismatches between 
a and the sequence o% implied by E and a. In other 
words, in addition to the recombination event probabil- 
ity P r ecomb(£'), likelihood involves the sequencing error 
rate R and the allele probabilities P(V a \V), etc. We em- 
phasize that we carry out this exercise independently for 
the data sets derived from different individuals. While we 
expect (and find) that -Precomb(-E') is consistent between 
individuals, we of course expect different individuals to 
have different allele probabilities. 

In the expectation maximization procedure, we start 
from a prior in which each factor in main text Eqn. 2 for 
-Precomb(-E0 is uniform in its variables, the sequencing er- 
ror rate R is set to a small value (typically 10 -4 ), and the 
allele probabilities are uniform over all the alleles of each 
gene. Using main text Eqn. 4, for each CDR3 sequence 
read a, we exhaustively compute the likelihoods of all 
recombination events E given cr, starting from maximal 
alignments for each sequence identified in the initial pars- 
ing of the read (previous section), and looping over the 
other scenarios, involving extra deletions compensated by 
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chance re-insertions of identical nucleotides, that could 
also 'explain' the read. We also loop over the number of 
true P-nucleotides in the cases where they are present. 

Normalizing these likelihoods yields the relative 
weights that observing the sequence a assigns to different 
recombination events E, given the current model parame- 
ters. Summing these weighted occurrences over all the se- 
quences in the data set gives a new, data-conditioned, es- 
timate of the various factors that enter into the assumed 
general form of Precomb(-E') (as well as a new estimate 
of the sequencing error probability and allele occurrence 
frequencies) . The formal statement of the update rule is 
as follows; for each parameter in the model that describes 
the probability of a specific recombination event feature 
X (say a particular V-gene choice) we update it to the 
probability weighted counts over the whole data set of 
that event. In other words, the (k + l)-th iteration of the 
model parameters are given by 

oev e 

= 1^1^ 5x ^ x F7iw3 ( C1 ) 



a-GT> E 



where Sx E ,x is one if X is true in the recombination 
event E and zero otherwise. This procedure is used to 
update all the factors entering into the likelihood calcu- 
lation and the process is repeated until convergence to a 
stable end point is achieved. Since all sequences in the 
data set are looped over in the calculation, we can record 
'on the fly' the likelihood L(a) (main text Eqn. 4), the 
generation probability P gcn (a) of that sequence (a con- 
ceptually different quantity), as well as the conditional 
entropy of events S(E\a) for each sequence quantifying 
the multiplicity of recombination events that could have 
produced the given CDR3 sequence). The product of 
L(a) over all sequences is the current overall likelihood 
of the data set, a measure of convergence of the proce- 
dure. The generation probabilities P gen (a) have a direct 
physical significance, reflecting the probability of gener- 
ation of the sequence by the molecular machinery. 



Iterating this process is guaranteed, by general expec- 
tation maximization arguments, to maximize the overall 
likelihood of the data set locally. We have found that 
rapid and direct convergence to a likelihood maximum 
is the norm for the data sets we work with (see Fig.[7|). 
The models for the probability distribution of generative 
events inferred in this way from the different data sets are 
available online |35| . The distribution is also described 
in a Microsoft Excel file. 



Appendix D: Sequencing error rate 

The sequence mismatch rate in our model reflects both 
uncorrected sequencing error as well as unknown allelic 
variation. Our model assumes that this mismatch rate R 
is independent of position along the sequence read. As 
is well-known, accuracy of the sequencing procedure be- 
comes worse at the end of the sequence read (the 5 ' , or 
V-gene, end of our CDR3 sequence) so, in assaying error 
rates, we ignore the last 15 nucleotides (at the 5 'end) 
for the 101 bp reads, where we can afford to do this. 
Our alignment procedure also disallows mismatches in 
the J- and D-gene alignment because of the shortness of 
these segments and the expected low error rate at this 
end (more accurately, the beginning) of the sequence 
read. In assessing position dependence of sequence er- 
ror rates, therefore, we only need concern ourselves with 
mismatches to V gene assignments. Summing all such 
mismatches for the three individuals for which we have 
101 bp reads, and plotting them against read position, 
we obtain the results plotted in Fig.[5J We find that R 
converges in the mean to a value of order 3 x 10~ 4 per 
base pair, two orders of magnitude smaller than the raw 
instrumental sequencing error rate. There are, however, 
a few sharp peaks at specific positions along the read; 
since they appear at the same position for different in- 
dividuals, they presumably reflect some anomaly in the 
functioning of the sequencing machine. This shortcom- 
ing of the error rate model does not greatly influence the 
results of the inference because the overall error rate is 
rather low. 
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FIG. 7: Convergence of the total likelihood of all data sets 
with iterations of the EM algorithm. 



Appendix E: Gene and pseudogene usage 

In Fig.[9j we show the inferred gene usage frequen- 
cies. As described in the main text, Fig. [9)3 reveals the 
mechanistic constraint prohibiting the recombination of 
the TRBD2 gene with any upstream TRBJ1 gene. We 
include pseudo V-genes in our analysis. These pseudo- 
genes cannot produce a functional receptor but they can 
participate in the recombination process and produce a 
non-productive rearranged CDR3 sequence which can be 
transmitted into the naive or memory compartments just 
like any other non-productive rearrangement. The set of 
V gene sequencing primers used by Robins et. al. pH|3] ei- 
ther exactly or approximately match 11 pseudogenes. Of 
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FIG. 8: Position-dependent error profile for the three indi- 
viduals with read length 101 base pairs. The sequencing read 
proceeds from the right (101 to 1) where the J gene sequencing 
primer binds. The spikes in the error rate at specific positions 
(67, 43 and 27) are true sequencing error spikes and not the 
result of unknown allelic variants. Positions 1-15 show the 
characteristic increase in error rate with read length. The 
overall decreased error rate in positions 10-20 reflect our re- 
quirement of a minimum alignment length of 20 nucleotides 
to a V gene with an upper bound on the allowed errors in the 
alignment. Since we do not allow any errors in the J and D 
genes, the error rate is zero in this region. 



Appendix G: Spurious shared sequences between 
repertoires 

Of the 9 individuals, we find three specific pairs of in- 
dividuals - (2,3), (2,7) and (5,6) - who have an unusually 
large number of sequences in common, in both the naive 
and memory compartments. While all other pairs of indi- 
viduals have between and 4 sequences in common, these 
three pairs have 15 to 90 shared sequences. Additionally, 
many of these shared sequences occur in both the naive 
and memory compartments of the individuals. We sus- 
pect that these anomalies are the result of inter-sample 
contamination. 

Hence, for our analysis of the distribution of shared 
sequences between individuals, we discard from consid- 
eration the four pairs of individuals (2,3), (2,7), (3,7) and 
(5,6). This leaves 32 pairs of individuals for our analy- 
sis. We also discard three specific additional sequences 
that occur in the naive and memory compartments of one 
individual and also in another individual. 



Appendix H: Convergent recombination and 
generation probability 



these, TRBV23-1, TRBV5-3, TRBV12-2 and TRBV6- 
7 show significant usage, together accounting for almost 
10% of CDR3 sequence reads. 



Appendix F: Memory T-cell non-productive 
repertoire 



We performed the same analysis on both the naive and 
memory T-cell repertoires. The non-productive CDR3 
sequences in both of these compartments should not be 
subject to selection, and a comparison of inferences from 
the two provides a test of this important assumption. Re- 
sults from the larger naive non-productive compartment 
(containing an average of 35,000 unique sequences per 
individual) were reported in the main text. Here we re- 
port the results from the smaller memory non-productive 
compartment (containing an average of 22,000 unique se- 
quences per individual) . In Fig. 10 we compare the naive 

In 



and memory insertions and deletions distributions. 
Fig.[Tl]we show that the occurence of shared sequences 
between the individual non-productive repertoires is con- 
sistent with our generative model for the memory com- 
partments as well. The plots show that the models in- 
ferred from the naive and memory T-cells are identical in 
all respects, in confirmation of the expectation that non- 
productive sequences are not subject to selection effects. 



As discussed in the main text, a typical CDR3 se- 
quence can be produced by « 32 different recombina- 
tion events, corresponding to an entropy of 5 bits per 
CDR3 sequence. In Fig. [12] we show the 2D histogram 
of the recombination entropy S(E\<t) and the generation 
probability P gen (a). As expected, sequences with higher 
recombination entropy tend to have higher total genera- 
tion probability, with a correlation of 0. 13. Note also that 
while the shared sequences between individuals (red dots) 
all have high P gcn (cr), they are widely distributed with 
respect to the recombination entropy, since only Pg Cn {cr) 
determines the recurrence probability of a sequence. 



Appendix I: Generation probabilities of productive 
sequences 

The probability distribution of recombination events 
that we infer enables us to calculate the generation prob- 
ability of any given TCR/3 CDR3 sequence. We calculate 
-Pgen(c) for all the sequences in the naive and memory 
productive repertoires. The distributions of these gener- 

The productive 



ation probabilities are shown in Fig. 13 



repertoires have systematically higher generation proba- 
bilities, implying that sequences that are more likely to 
be generated are also more likely to pass selection filters 
and survive in the blood. This is, in part, due to sys- 
tematically fewer insertions in the productive repertoires, 
which have exponentially higher generation probabilities. 
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A TRBV Gene Usage B TRBD Gene Usage 




J1-1 J1-2 J1-3 J1-4 J1-5 J1-6 J2-1 J2-2 J2-3 J2-4 J2-5 J2-6 J2-7 J1-1 J1-2 J1-3 J1-4 J1-5 J1-6 J2-1 J2-2 J2-3 J2-4 J2-5 J2-6 J2-7 



FIG. 9: Statistical aspects of gene usage. (A) Usage frequencies of V-genes, ordered by position on the chromosome, with the 
exception of pseudogenes (red legend). (B) Usage frequencies of the two D-genes. (C) Same for the 13 J-genes. (D) D-gene 
usage frequencies, conditioned on J-gene choice. As expected from the mechanistic constraint, TRBD2 has essentially zero 
probability ( < 0.1%) of recombining with any TRBJ1 gene. Error bars indicate variation across the nine individuals. 




FIG. 10: Comparison of insertions (A) and deletions (B) distributions for the naive and memory T-cell repertoires. We find 
that the inferred models from the two compartments are statistically identical in all respects. Error bars indicate variation 
across the nine individuals. 



Appendix J: Test of analysis on simulated sequences 



As noted in the main text, we infer the probability 
distribution of generative events from nonproductive se- 
quences only. One might worry that using such a non- 
random subset of all the sequences produced by VD J re- 
combination could introduce a bias in the inference. We 
first note that the condition for a rearranged CDR3 se- 
quence to be out of frame involves the sum of six variables 



that our analysis has shown to be uncorrelated: 

[ - delV + inaVD - del5'L> + length(D) 
- del3 'D + insD J - delJ] mod 3 > 0. 

Since a large number of uncorrelated variables are in- 
volved, it is a priori unlikely that this constraint would 
significantly influence the evaluation of the distributions 
that define our generative model. We can test this quan- 
titatively by generating a simulated sequence repertoire 
from our recombination event distribution, running our 
inference algorithm on the out-of-frame subset of these 
sequences, and then comparing the inferred and the "ac- 
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FIG. 11: Shared sequences in memory T-cell non-productive CDR3 sequence repertoires. A) Distribution of number of shared 
sequences between the 9 individuals. B) Distribution of P gcn (cr) for the entire repertoire (blue) and for the recurring sequences 
(red). (-Pgen) is indicated by the green vertical line. 



-8 
-10 
-12 
§ -14 

L 

° -16 

O 

-18 

-20 
-22 
-24 




123456789 
Entropy of recombination events given sequence, in bits 



400 

350 

300 

250 

200 

150 

100 

50 





FIG. 12: A 2D histogram of conditional entropy of recom- 
bination events given the sequence and P ge n(o"). Convergent 
recombination (as measureed by the recombination event en- 
tropy) is a contributing factor to P gcn (<j), with correlation 
coefficient 0.13. The shared sequences in the naive non- 
productive repertoires are shown in red. 
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FIG. 13: Generation probabilities of all the CDR3 sequences 
in the naive and memory productive repertoires were com- 
puted using our inferred distribution. The above panel shows 
the distribution of the logarithm of these probabilities for the 
three repertoires for one individual. The productive reper- 
toires have systematically higher generation probabilities. 



tual" event distributions. The result of this analysis on 
a simulated repertoire of 10 5 sequences (two-thirds of 
which were out-of-frame) is displayed in Fig. 



14 



It 



clear that the initial and the inferred generative distri- 
butions are identical to each other, confirming that the 
condition of being out-of-frame does not bias the statis- 
tics of recombination events and does not interfere with 
our ability to correctly infer the probability distribution 
of these events. 



Appendix K: Occurrence of palindromic nucleotides 
with non-zero deletions 



To show that the occurrence of palindromic nucleotides 
with non-zero nucleotide deletions from the ends of the 
genes is consistent with chance insertions, we keep track 
of the (model probability weighted) joint frequencies of 
lengths of observed palindromes conditioned on the num- 
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FIG. 14: Probabilities of recombination event variables were re-inferred by simulating sequences from our final distributions, 
discarding all in-frame sequences, and running the expectation-maximization algorithm on the out-of-frame subset. The above 
scatter plots show that the original probabilities are obtained. This provides evidence that the use of just the non-productive 
TCR sequences does not bias the statistics of recombination events. 



ber of deletions and on gene choice. Keeping track of this 
detail is necessary because of the strong dependence of 
deletion probabilities on gene choice. After we obtain our 
converged model, we calculate the frequencies of chance 
palindromic nucleotides of different lengths co-occurring 
with non-zero deletions (taking into account all the struc- 
ture of Precomb(£')7 including the nucleotide bias in in- 
sertions). The plot in Fig.[l5| shows that the observed 
frequencies of palindromic nucleotides co-occurring with 
non-zero deletions are completely consistent with those 
expected by chance insertions. 
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FIG. 16: Position weight matrix for sequence dependence of 
nucleotide deletion position. The figure shows e/log(10) (see 
text of Appendix L) fit to the V gene specific deletions profiles, 
using four nucleotides 3 ' and two nucleotides 5 ' of the deletion 
position (black vertical line). The 3' nucleotides are the most 
informative about deletion probability and show a preference 
for T and A. The sequence logo corresponding to this position 
weight matrix is shown in the main text Fig.|4j3. 



FIG. 15: Occurrence frequency of P-nucleotides for non-zero 
deletions. 
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FIG. 17: Deletion profiles for all the V-genes (1 of 3). The title for each panel lists the gene name and total number of counts, 
across all the individuals studied, of the particular gene in question. Individuals with fewer than 100 counts for a specific gene 
are plotted in gray dashed lines. The blue lines show the predictions of the position weight matrix based model fit to these 
curves. 



16 



0.4 
0.3 
0.2 
0.1 

o 



TRBV11-2 , N = 4318 



TRBV6-5 , N = 15007 



TRBV7-4 , N = 4603 



TRBV5-4 , N = 3760 





-4 -2 2 4 6 8 10 12 



-4 -2 2 4 6 8 10 12 




-4 -2 2 4 6 



4 -2 2 4 6 



TRBV6-6 , N = 2114 



TRBV5-5 , N = 2853 



TRBV7-6 , N = 6292 



TRBV5-6 , N = 6079 




-4 -2 2 4 6 8 10 12 

nj 
.a 
O 

qI TRBV6-8 , N = 1088 

0.4 



-4 -2 2 4 6 8 10 12 



TRBV7-7 , N = 5393 



-4 -2 2 4 6 8 10 12 



TRBV6-9 , N = 897 



-4 -2 2 4 6 8 10 12 



TRBV7-8 , N = 1950 




-4 -2 2 4 6 8 10 12 



-4 -2 2 4 6 8 10 12 



-4 -2 2 4 6 8 10 12 



-4 -2 2 4 6 



TRBV5-8 , N = 685 

CGACCACC 
■ 5* 



TRBV7-9 , N = 1 7340 



TRBV13 , N = 1421 



TRBV10-3 , N = 3557 



AACCGGTT 

3' 




-4 -2 2 4 6 8 10 12 



0.4 
0.3 
0.2 
0.1 

o 




-4 -2 2 4 6 8 10 12 



-4 -2 2 4 6 8 10 12 



T C AG 

3' : 


CT GAGT G A CT AC 

: 5' 






■ J 


i ■ . ' . \ 



-4 -2 2 4 6 



10 12 



Number of Deletions 



FIG. 18: Deletion profiles for all the V-genes (2 of 3). The title for each panel lists the gene name and total number of counts, 
across all the individuals studied, of the particular gene in question. Individuals with fewer than 100 counts for a specific gene 
are plotted in gray dashed lines. The blue lines show the predictions of the position weight matrix based model fit to these 
curves. 
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FIG. 19: Deletion profiles for all the V-genes (3 of 3). The title for each panel lists the gene name and total number of counts, 
across all the individuals studied, of the particular gene in question. Individuals with fewer than 100 counts for a specific gene 
are plotted in gray dashed lines. The blue lines show the predictions of the position weight matrix based model fit to these 
curves. 
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FIG. 20: Deletion profiles for all the J-genes. The title for each panel lists the gene name and total number of counts, across all 
the individuals studied, of the particular gene in question. Individuals with fewer than 100 counts for a specific gene are plotted 
in gray dashed lines. The blue lines show the predictions of the position weight matrix based model fit to the V deletions 
curves, but evaluated on the J gene sequences. 
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FIG. 21: Marginal deletion probability distributions for the two D-genes. Deletions at the 5 'end (3 'end) of the D gene are 
shown in green (blue). The x-axis displays the gene sequence from the 5 'end to the 3 'end. 



Appendix L: Sequence dependence of nucleotide 
deletion probabilities 
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Z{a) = ^2 ex P ^2 e ( fc ' - 4 + fc ) 



(L2) 



71 = 2 



\fc=l 



Since the sequence at the 3 ' end of the V gene varies 
between genes, we fit a simple model to the gene depen- 
dent deletions profiles to explain the variation in these 
distributions. The precise mechanism of the generation 
of P-nucleotides and their relationship to deletions is un- 
clear. Hence, we take only the probabilities of deletions 
greater than or equal to two nucleotides and consider the 
nucleotide sequence context (four bases 3 ' and two bases 
5 ' of the deletion position) as a predictor of the deletion 
probability. We use a function of the form 



P(n deletionsjcr & n > 2) 



ex P ( J2t=i e(k,a(n- 4 + k) 



Z(a) 



(LI) 



where e is a 6 x 4 matrix containing the contribution of 
each possible nucleotide at each of the positions, analo- 
gous to a (log) Position Weight Matrix (PWM) . We do a 
least squares fit to determine the elements of e. In Fig. [16} 
we show e fit to the V deletions. There is a strong prefer- 
ence for T and A, especially in the 2 nucleotides just 5 ' of 
the position of deletion. Since there are only 13 J-genes, 
there is less sequence variation among them that we can 
utilize. 
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We verify that the condition for being non-productive 
does not bias recombination event statistics by analyzing 
simulated sequences. See Appendix [J] 
Here we distinguish only the genes, not their various al- 
leles. The gene list includes germline pseudo-genes: they 
cannot produce functioning receptor proteins but, since 
we work with non-coding VDJ rearrangements, pseudo- 
gene sequences can appear in the data. 
We use the known alleles for each gene listed in the IMGT 
data base [13] augmented by a few additional variants 
observed in the data (see Appendix |A| for details). 
Recall that this estimate is for the /3-chain only. The a- 
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